Identifying Biomarkers from Mass Spectrometry Data with Ordinal Outcome

Summary: In recent years, there has been an increased interest in using protein mass spectroscopy to identify molecular markers that discriminate diseased from healthy individuals. Existing methods are tailored towards classifying observations into nominal categories. Sometimes, however, the outcome of interest may be measured on an ordered scale. Ignoring this natural ordering results in some loss of information. In this paper, we propose a Bayesian model for the analysis of mass spectrometry data with ordered outcome. The method provides a unified approach for identifying relevant markers and predicting class membership. This is accomplished by building a stochastic search variable selection method within an ordinal outcome model. We apply the methodology to mass spectrometry data on ovarian cancer cases and healthy individuals. We also utilize wavelet-based techniques to remove noise from the mass spectra prior to analysis. We identify protein markers associated with being healthy, having low grade ovarian cancer, or being a high grade case. For comparison, we repeated the analysis using conventional classification procedures and found improved predictive accuracy with our method.


Introduction
In recent years, technologic developments have spurred interest in using protein mass spectroscopy to identify molecular markers for discriminating between phenotypic groups [1]. The diagnostic categories often consist of tumor versus normal tissues, different types of malignancies, and subtypes of a specifi c cancer. Several variable selection methods have been developed to address this problem [2,3,4]. These procedures are tailored towards classifi cation into nominal categories. In some cases, however, the outcome of interest may have an ordered scale. Examples of variables with a natural ordering include the stage or grade of a tumor and quantitative clinical factors such as white blood cell counts. Applying methods designed for nominal variables to such problems may not be optimal since the information about the ordering is ignored. Chu et al. [5] have recently proposed a gene selection algorithm based on Gaussian processes to identify expression patterns associated with ordinal phenotypic outcomes in DNA microarray data.
We analyzed surface -enhanced laser desorption/ionization time-of-fl ight (SELDI-TOF) mass spectrometry data from a proteomic discovery and biomarker validation study for ovarian cancer conducted at the National Cancer Institute [6]. In ovarian cancer, more than two-thirds of cases are detected at an advanced stage, resulting in poor overall fi ve -year survival rates of 10-30% [7]. This is in stark contrast to stage I/IIa patients with 95% fi ve -year survival [7]. Cancer antigen 125 (CA -125) is the most widely used biomarker for ovarian cancer. However, it does not have adequate sensitivity and specifi city to be used as a screening tool. Even in conjunction with transvaginal sonography, the positive predictive value of CA -125 is only about 20% [8]. Protein mass spectroscopy has been used previously to identify markers that may improve the diagnostic performance of existing markers for early detection of ovarian cancer [1,9]. In this paper, we aimed to identify proteomic markers that are related to an ordinal measure of disease severity defi ned in terms of tumor grade.
Mass spectrometry data are inherently noisy. A pre-processing step is needed before any analysis. Several algorithms have been developed to this end [10,11,12]. Here, we adopt a pre-processing approach that uses wavelet techniques to remove noise from the mass spectra. We then propose a Bayesian variable selection method for classifying individuals into ordinal categories and apply the method to the processed mass spectra. In our approach the ordered outcomes are related to the protein levels using a data aug mentation approach. The variable selection procedure is built into the model through a latent binary inclusion/exclusion vector. Markov chain Monte Carlo (MCMC) stochastic search techniques are used to update this latent vector and to explore the prohibitively large space of possible predictor combinations. Posterior inference identifi es discriminating variables and predicts the ordered group membership of a sam ple via Bayesian model averaging. This allows us to account for the uncertainty inherent in the model selection process.
We compare our prediction results with those obtained from commonly used classification methods, such as linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), k-nearest neighbor (KNN), and support vector machines (SVM). These methods, unlike our Bayesian model, build a multi class classifi er that ignores the natural ordering of the outcome. Moreover, with the exception of SVM, which provides a relevance measure for each variable, these procedures do not perform selection of the discriminating markers.

Experimental Data
Serum samples collected at the Mayo Clinic between 1980 and 1989 were analyzed by surfaceenhanced laser desorption and ionization time -offl ight (SELDI-TOF) mass spectrometry using the CM10 chip type [13]. The ProteinChip Biomarker System (Ciphergen Biosystems) was used for protein expression profi ling. Serum samples were analyzed by scientists blinded to disease status at Ciphergen Biosystems. Information on subjects included patient's age at diagnosis, CA-125 levels, and stage and grade for all the cancer cases. A detailed description of the samples and exclusion criteria can be found in [6]. In this paper, we focus on 50 samples obtained after 1986 whose serum was freeze-thawed a single time. They consist of 10 individuals free of ovarian cancer as well as cases with tumors graded as "well differentiated" (n = 5), "moderately differentiated" (n = 6), "poorly differentiated" (n = 13), and "undifferentiated" (n = 16). We defi ned three ordinal classes based on tumor grade: Z = 0 for controls, Z = 1 for well or moderately differentiated tumor, and Z = 2 for poorly or undifferentiated tumor.

Methods
Pre-processing of mass spectrometry profi les Protein mass spectra are inherently noisy and require substantial pre-processing before analysis. A mass spectrum can be represented as a curve where the x-axis indicates the ratio of a particular molecule's weight to its electrical charge (m/z) and the y-axis represents a signal intensity corresponding to the abundance of the molecule in the sample. Most peaks in the spectrum are associated with proteins or peptides and constitute important features. The goal of the analysis is often to identify peaks related to specifi c outcomes, such as different malignancies or clinical responses. Before proceeding to the data analysis, a number of pre-processing steps, such as removal of baseline and noise, normalization and calibration of samples, are needed. The procedures to perform these steps are still experimental and no standard has yet been established. The pre-processing steps we used are described below and summarized in Figure 1.

Baseline correction
This step is required to remove the ion overload and chemical noise that are usually higher at smaller m/z values. There is no general solution to this problem because baseline characteristics vary from one experiment to another and each spectrum has to be assessed individually. For the data considered in this paper, the baseline subtraction algorithm implemented in the BioConductor PROcess package (www.bioconductor.org) was used. This function splits a spectrum into a number of exponentially growing regions, calculates the quantiles in each region, and smoothes the results using the loess function.
Noise removal by wavelet methods Wavelets are families of orthonormal bases that can be used to parsimoniously represent functions. Following the seminal work of Donoho and Johnstone [14], wavelet thresholding has successfully been used in various applications to remove noise and recover the true signal intensities [15]. This is accomplished by applying a wavelet transform to the data and mapping wavelet coeffi cients that fall below a threshold to 0 (hard thresholding) or shrinking all coeffi cients toward 0 (soft thresholding). One can also opt between a universal or an adaptive thresholding rule. The former applies the same threshold, i.e. identical cut-off value or same amount of shrinkage for all wavelet coefficients, whereas the latter uses a threshold that depends on the resolution level of the wavelet coeffi cients. An inverse wavelet transform is then applied to the thresholded coeffi cients leading to a smoothed estimate of the function. We discarded m/z values lower than 2,000 due to large noise and m/z values greater than 15,000 because all the intensities in this range were very low. For the remaining data, we interpolated the mass spectra on a grid of equally spaced m/z values with 500,000 equi-spaced points using piecewise cubic splines. We noticed better qualitative denoising with undecimated transforms over standard decimated discrete wavelet transforms (DWT). These transforms do not impose restrictions on the length of the signal and are shiftinvariant, i.e. they are not affected by the starting position of the signal. We used the maximum overlap discrete wavelet transforms (MODWT) [16] with Daub(4) along with an adaptive soft thresholding rule. Figure 2 displays spectra plot after baseline correction and noise removal Normalization When dealing with multiple spectra it is a good practice to remove effects from systematic variation among spectra due to varying amounts of protein or to variation in the detector sensitivity. For this we used a global normalization procedure where mass intensities are scaled by a common factor. For a given peak in a given spectrum we computed the area under this peak, i.e. the sum of all intensities, from all spectra. We then defi ned the constant factor as the ratio of the area under this peak and the median of areas of all peaks.
Peak identifi cation A crucial step for the identifi cation and quantifi cation of proteins in mass spectra is to fi nd m/z values that correspond to peak intensities. We used the peak detection methods implemented in the PROcess library from BioConductor with the default settings. For each spectrum, peaks were identifi ed as m/z values with signal intensities satisfying the following criteria: 1) the intensity exceeds a specifi ed threshold value; 2) the intensity exceeds a constant times the median absolute deviation estimate of noise in a given window; 3) the intensity is a local maximum within a given window; 4) the ratio of the area under the peak, i.e. the sum of the intensities within a bandwidth, versus the maximum area among all peaks is greater than a pre-specifi ed constant.

Peak alignment
Mass spectra exhibit shifts along the horizontal axis between replicate spectra. In general, the instruments have an accuracy of 0.1 to 0.3% on the m/z scale. Thus, detected peaks that have masses within the percentage accuracy are considered identical. We merged peaks that have m/z measurements within 0.2% of each other and assigned the new peak the average m/z values and the maximum intensity.

Probit model for ordinal outcomes
Let (Z,X ) denote the observed data, where Z n × 1 is the vector of ordered categorical outcomes and X n × p is the matrix of covariates. In our setting, X contains the intensities at given m/z values. The responses Z i take one of J values, 0, ..., J -1. Each outcome Z i is associated with a vector (p i , 0 , ..., p i , J -1 ), where p i , j = P(Z i = j) is the probability that subject i falls in the ordered class j. The probabilities p i , j can be related to the linear predictor x i β by adopting a data augmentation approach [17]. We assume that there exists a latent continuous random variable Y i , such that (1) where α is an intercept parameter, β is a p × 1 vector of regression coeffi cients and σ 2 is set to 1 to make the model identifi able. The correspondence between the observed outcome Z i and the latent variable Y i is defi ned by where the boundaries δ j are unknown and -∞ = δ 0 < δ 1 < ... < δ J -1 < δ J = ∞.

Identifying biomarkers for ordinal outcome
Incorporating variable selection into the model Without loss of generality, we assume in the sequel that X has been centered, so that its columns sum to zero. Thus, rank (X ) ≤ min(n -1, p).
In our application, most of the predictors provide no information about the outcome of interest. In order to identify the informative predictors, we introduce a latent binary inclusion/exclusion vector γ that induces a mixture prior on the regression coeffi cients. We specify conjugate priors for the intercept α ∼N (α 0 , h) and the regression coeffi cients of the included variables β γ ∼ N (β 0γ , H γ ). The simplest form for the prior of γ is to assume its elements to be independently and identically distributed Bernoulli random variables, π(γ) = w p γ (1 -w) p-p γ , where w is the proportion of variables expected a priori to be related to the outcome and p γ is the number of included variables. This prior can be relaxed and more uncertainty can be introduced by assuming a further beta prior on w.
The method we propose here for variable selection is closely related to the approach presented in Sha et al. [4] for multinomial probit models. In this context, however, the correspondence between Z i and Y i uses different boundaries that account for the natural ordering of the outcome. In addition, here Y is a vector that follows a truncated normal distribution, whereas in the multi-group classifi cation case Y is a matrix and follows a truncated multivariate-t distribution. The resulting Gibbs sampler is therefore computationally less demanding in the ordinal setting.

Hyperparameter settings
A vague prior can be specifi ed on the intercept parameter α by setting h large, so that the value ascribed to the prior mean becomes irrelevant. We set α 0 = 0 and β 0 = 0. For a given γ, the prior on β depends on the matrix H γ . Brown et al. [18] discuss relative merits and drawbacks of different specifications. Here we use H = cI, which is easier to calibrate. The parameter c regulates the amount of shrinkage in the model. In general, we want to avoid very small values of c which cause too much regularization and large values that can induce nonlinear shrinkage as a result of Lindley's paradox [19]. In Sha et al. [4], we provided some guidelines on how to choose this hyperparameter in the context of probit models for classifi cation into nominal groups. We suggest using similar guidelines here. Specifi cally, we recommend choosing c such that the ratio of prior to posterior precision is relatively small. In practice, values of c that provide good mixing of the MCMC sampler, with 25-50% distinct visited models are appropriate. For the boundary parameters, we need to impose one constraint to ensure identifi ability; without loss of generality we take δ 1 = 0. For the remaining boundaries, we assign diffuse priors that express no prior belief by setting δ j to be uniformly distributed on (δ j -1 , δ j + 1 ).

Model fi tting
The prior beliefs are then updated with information from the data. We perform posterior inference using Markov chain Monte Carlo (MCMC) techniques. The model fi tting can be made more efficient by integrating out the parameters α and β. The MCMC sampler starts from a set of arbitrary parameters and the following steps are iterated: 1) Update the latent vector Y from its posterior distribution given (γ, δ, X, Z ), which is a truncated normal density under the constraints defi ned in equation (2) Y |(γ, δ, X, Z) ∼N δ (1 n α 0 + X γ ß 0γ , P γ ), (3) where P γ = I n + h1 n 1΄ n + X γ H γ X´γ , 1 n is an n × 1 vector of ones, I n is an n × n identity matrix.
2) Update the latent variable selection vector γ from its conditional posterior distribtion π(γ|Y, δ, X, Z ) ∝ π(γ) . π(Y|γ, δ, X, Z ). (4) This is accomplished using a Metropolis algorithm as in Sha et al. [4]. In this approach, the sampler visits a sequence of models that differ successively in one or two variables. At a generic step, a candidate model, γ new , is generated by randomly choosing among a set of transition moves. These moves consist of adding or deleting a variable by choosing one of the γ k 's (k = 1, ..., p) and changing its value, or swapping the status of two variables by choosing independently and at random a 0 and a 1 and exchanging their values. The proposed γ new is accepted with a probability that depends on the ratio of the relative posterior probabilities of the 3) Update the boundary parameters δ j from their posterior densities given (γ, X, Z, δ (-j) ), where δ (-j) is the vector δ without the j-th element. These conditional distributions are uniform on the interval [max{max{Y i : Z i = j -1},δ j-1 }, min{min{Y i : Z i = j},δ j + 1 }], as described in Albert and Chib [17].

Posterior inference
The MCMC procedure results in a list of visited variable subsets, γ, as well as sampled δ and Y vectors with their corresponding relative posterior probabilities. In order to draw posterior inference, we fi rst need to impute the latent vector Y, which can be viewed as missing data. Let Ŷ and δ be the estimates obtained by averaging respectively over the sampled Y and δ vectors. The normalized conditional probabilities π(γ|Ŷ, δ, X, Z ), which identify promising variable subsets, can be computed for all distinct vectors γ visited by the MCMC sampler. The marginal posterior probabilities of inclusion for single variables, π(γ k = 1|Ŷ, δ, X, Z ), k = 1, ..., p, can also be derived from these posterior probabilities. Inference on class prediction can be done in various ways. If a further set of observations is available for validation, least squares prediction based on a single "best" model can be computed: where γ is the vector with highest posterior probability, X γ consists of the covariates selected by γ, α = Ŷ, β γ = (X ' γ X γ + H γ -1 ) -1 X ' γ Ŷ. Alternatively, we can use Bayesian model averaging over a set of a posteriori likely models to estimate Y f : The ordered categorical outcomes can then be predicted using the correspondence In situations where the sample size is limited, which is typical in genomic and proteomic experiments, dividing the data into a training and a validation set may not be possible. In such cases, one can resort to sampling-based methods for crossvalidation prediction [20]. A cross-validation predictive distribution for sample i can be calculated using π(γ,Y, δ| X, Z ) as importance sampling density for π(γ, Y, δ| X, Z (-i) ), where Z (-i) is the outcome vector Z without the i-th element: . Y t ( ) and x i γ (t) are sample i's mesurements for the variables selected by γ (t) .The class membership of sample i can then be predicted by the mode of the predictive distribution: Figure 2 displays the pre-processed mass spectra for three women randomly chosen from each of the three groups. Each spectrum represents the expression profi le of peptides defi ned by their m/z values. We note some clear differences between the three curves. We pre-processed the spectra as described in the Methods section. After applying the wavelet thresholding for noise removal, the peak identifi cation and alignment steps resulted in 39 peaks.

Results
We fi tted the ordinal probit model with variable selection to identify protein markers that discriminate among the three groups. We used a Bernoulli prior with 10 variables expected to distinguish the classes. We ran four MCMC chains with widely different starting values for 100,000 iterations each and discarded the fi rst half as burnin to eliminate dependence on the starting points. We considered several hyperparamater values for the covariance of the regression coeffi cients, with c ranging between 0.1 and 10. Although there was minimal effect on the overall results, we found that smaller values of c tended to allocate a couple more samples from 'low grade' into 'control', and for larger values of c a couple more samples from 'low grade' were being misclassifi ed as 'high grade'. Here, we report the results for c = 3. Each chain visited about 22,000 distinct models after the burn-in period. The majority of the visited models contained around 10 variables. The marginal probabilities of inclusion for single peaks for each of the four MCMC chains are shown in Figure 3. Indices with high posterior probabilities correspond to important markers that discriminate between the different groups. We note that despite the widely different starting models, similar regions are visited by the different MCMC runs and there is a good concordance among the four plots. We therefore drew posterior inference on the pooled output from the four MCMC chains. We considered variables with large marginal posterior probabilities as well as markers included in the "best" models, i.e. γ vectors with high joint posterior probabilities. The list of selected markers based on marginal probabilities of inclusion greater than 0.1 and based on the best model are reported in Table 1. We note that there is a good agreement between the results. The best model contain seven markers, which are denoted by asterisk characters. Six of these are also selected based on their marginal probabilities of inclusion. Figure 4 displays surface representations of single spectra in each of the three groups for m/z values between 2,000 and 15,000. The arrows on top of the graph indicate peaks that appeared in the best model. We note that they clearly distinguish the different groups. For comparison, we performed Kruskal-Wallis analysis of variance on each peak to identify those that are signifi cantly different between the three classes. There were 6 peaks with pvalues less than 0. We used the cross-validation prediction approach described in the Methods section to assess the predictive ability of the selected discriminants. The results are reported in Table 2. We obtained an overall misclassifi cation rate of 19/50 = 0.38. For comparison, we analyzed the data using common classifi cation methods, such as linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), k-nearest neighbor (KNN), and support vector machines (SVM), which build multi-class classifi ers without taking the natural ordering of the response into account. In addition, except for SVM which gives a relevance measure for each variable, these methods do not provide a selection of the discriminating markers. For QDA, we obtained best results by fi rst applying principal component analysis (PCA) to the data and performing the discriminant analysis on 5 components. For KNN, we considered values of k ranging from 2 to 8 and we report the results for k = 3, which gave the lowest overall misclassifi cation rate. We note that all the procedures had higher error rates compared to our method. In particular, our approach performed much better in separating class I and class III, which correspond respectively to disease free and poorly-to non-differentiated samples. As we noted above, the standard classification approaches do not perform variable selection. A common practice in applying these classifi cation methods consists of fi rst running univariate tests to identify signifi cantly different variables. The selected subset of variables is then used in the classifi cation algorithm. We repeated the comparison with the standard classifi cation methods using this two-stage approach. For each of the standard classification procedures, we assessed their cross-validation errors by considering the spectra found to be differentially expressed across the three groups. This was achieved by running an analysis of variance and Identifying biomarkers for ordinal outcome selecting the spectra with p-values less than 0.1 at every leave-one-out prediction [21]-there were 4 to 11 variables selected. This approach resulted in higher misclassifi cation error rates for all the methods compared to their performance based on the whole spectral data.

Discussion
We have proposed a Bayesian approach for classifi cation problems with ordinal outcomes and high-dimensional predictor data. While MCMC techniques are generally computationally intensive, our method is fairly straightforward. Once we augment the data and introduce latent variables underlying the ordinal outcomes, the problem reduces to variable selection in linear model setting, with the additional requirements of updating the latent continuous variables and their boundaries. We have made our Matlab code for implementing this procedure available at www. stat.tamu.edu/mvannucci/webpages/codes.html. We have illustrated the performance of our method with an application to mass spectrometry data from an ovarian cancer study. The ordinal outcome groups consisted of a control group and two case groups defi ned in terms of tumor differentiation. The overall cross-validated prediction accuracy was close to 62%. Not surprisingly, most of the misclassifi ed samples were from the cases with well and moderately differentiated tumors, which would be expected to be diffi cult to capture. The prediction errors, however, could also be attributed to the relatively long storage time of the samples, which may have laid to degradation of some proteins. Nonetheless, our method identifi ed 11 peaks as possible predictors. Several of those peaks correspond to proteins that have previously been shown to be associated with ovarian cancer. One of the predictive peaks, m/z value 3,271 we believe is inter-α tyrpsin inhibitor heavy chain 4 (ITIH4), which has been found to predict ovarian cancer by Zhang et al. [9], Fung et al. [22] and Song et al. [23]. However, our fi ndings are based on small number of samples in each group and need to be confi rmed in larger studies. Ordinal outcomes not only occur when dealing with tumor stages, but also in settings where one wishes to associate an environmental exposure with protein levels in serum or urine. For example, in an ongoing study, we are applying our method to mass spectrometry data obtained from urine samples of subjects with low, moderate and high levels of exposure to arsenic in drinking water. The identifi ed markers can subsequently aid in etiologic studies of arsenic exposure and cancer outcomes.
We have also proposed wavelet-based techniques for pre-processing the raw mass spectrometry data. We explored different choices of wavelet basis (Haar wavelets, Daubechies wavelets, least symmetric Daubechies wavelets) and different thresholding rules (hard versus soft and universal versus adaptive). In general, the universal hard threshold removes lots of coeffi cients and the universal soft threshold tends to attenuate some of the distinctive peaks. The adaptive soft thresholding approach, on the other hand, does a better job at preserving the peaks. We therefore used soft and adaptive wavelet thresholding to remove noise from the spectra. In the future, we plan to investigate alternative approaches, such as block shrinkage methods [24].